观察一组图像及其相应的段落限制,一个具有挑战性的任务是学习如何生成语义连贯的段落来描述图像的视觉内容。受到将语义主题纳入此任务的最新成功的启发,本文开发了插件的层次结构引导图像段落生成框架,该框架将视觉提取器与深层主题模型相结合,以指导语言模型的学习。为了捕获图像和文本在多个抽象层面上的相关性并从图像中学习语义主题,我们设计了一个变异推理网络,以构建从图像功能到文本字幕的映射。为了指导段落的生成,学习的层次主题和视觉特征被整合到语言模型中,包括长期的短期记忆(LSTM)和变压器,并共同优化。公共数据集上的实验表明,在标准评估指标方面具有许多最先进的方法竞争的拟议模型可用于提炼可解释的多层语义主题并产生多样的和相干的标题。我们在https://github.com/dandanguo1993/vtcm aseal-image-image-paragraph-caption.git上发布代码
translated by 谷歌翻译
大多数图形之间的作品都是在具有交叉注意机制的编码器框架上构建的。最近的研究表明,对输入图结构进行明确建模可以显着改善性能。但是,香草结构编码器无法在所有解码步骤的单个正向通道中捕获所有专业信息,从而导致语义表示不准确。同时,输入图在交叉注意中作为无序序列被扁平,忽略了原始图形结构。结果,解码器中获得的输入图上下文向量可能存在缺陷。为了解决这些问题,我们提出了一种结构感知的交叉注意(SACA)机制,以在每个解码步骤中以结构意识的方式重新编码在新生成的上下文上的输入图表示条件。我们进一步调整SACA,并引入其变体动态图修剪(DGP)机制,以在解码过程中动态下降无关的节点。我们在两个图形数据集(LDC2020T02和ENT-DESC)上实现了新的最新结果,但计算成本仅略有增加。
translated by 谷歌翻译
文本到SQL解析是一项必不可少且具有挑战性的任务。文本到SQL解析的目的是根据关系数据库提供的证据将自然语言(NL)问题转换为其相应的结构性查询语言(SQL)。来自数据库社区的早期文本到SQL解析系统取得了显着的进展,重度人类工程和用户与系统的互动的成本。近年来,深层神经网络通过神经生成模型显着提出了这项任务,该模型会自动学习从输入NL问题到输出SQL查询的映射功能。随后,大型的预训练的语言模型将文本到SQL解析任务的最新作品带到了一个新级别。在这项调查中,我们对文本到SQL解析的深度学习方法进行了全面的评论。首先,我们介绍了文本到SQL解析语料库,可以归类为单转和多转。其次,我们提供了预先训练的语言模型和现有文本解析方法的系统概述。第三,我们向读者展示了文本到SQL解析所面临的挑战,并探索了该领域的一些潜在未来方向。
translated by 谷歌翻译
最近训练模型通过利用大规模文本语料库来改善神经网络的上下文表示能力,显着提高了各种NLP任务的性能。大型预培训语言模型也已应用于表语义解析的区域。然而,现有的预训练方法没有仔细探索问题与相应的数据库模式之间的明确互动关系,这是揭示其语义和结构对应的关键成分。此外,在架构接地背景下的问知表示学习在预训练目标中受到更少的关注。为了减轻这些问题,本文设计了两种新的预训练目标,将所需的归纳偏差将所需的归纳偏差施加到表前的学习表现-训练。我们进一步提出了一种模式感知课程学习方法来减轻噪声的影响,并以易于努力的方式从预训练数据中学习。我们通过在两个基准,蜘蛛和罢工中进行微调,评估我们预先接受训练的框架。结果表明,与各种基线相比,我们的预训练目标和课程的有效性。
translated by 谷歌翻译
英语研究文章(RAS)是学术界的重要类型,因此在过去的二十年中,雇用NLP的企图雇用NLP的发展得到了相当大的关注。然而,没有研究采用特征工程技术来研究不同学术影响的RA的语言特征(即,在高/中等冲击因子期刊上发表的高/中等引用时间的论文)。本研究试图利用特征工程方法提取高和中度冲击轴颈RA中的微观语言特征。我们通过特征选择方法从英语期刊文章中提取25个高度相关的功能。所有论文都与Covid-19医学实证研究协议。然后通过监督机器学习方法在一致性和准确性方面验证所选功能。结果表明,24个语言特征,如相邻句子之间的内容词重叠,使用第三人称代词,辅助动词,时态,情绪词汇提供了对具有不同学术影响的期刊文章的一致和准确的预测。最后,随机森林模型被证明是适合这24个特征与期刊文章之间的关系的最佳模型,以及具有高和中等的冲击。这些调查结果可用于通知学术写作课程,并为L2研究生开发自动评估系统的基础。
translated by 谷歌翻译
Text-to-SQL旨在将自然语言问题映射到SQL查询。基于草图的方法与执行引导的(例如)解码策略相结合,在WikiSQL基准上显示了强烈性能。然而,执行引导的解码依赖于数据库执行,这显着降低了推理过程,因此对于许多真实世界的应用程序来说是不令人满意的。在本文中,我们介绍了模式依赖性指导多任务文本到SQL模型(SDSQL)来指导网络以有效地捕获问题和模式之间的交互。所提出的模型优先于两个设置中的所有现有方法,而且没有例如例如。我们展示了架构依赖学习部分涵盖了诸如益处,例如,减轻了对它的需求。没有例如在推理期间显着减少时间消耗的SDSQL,仅牺牲少量性能,并为下游应用提供更多的灵活性。
translated by 谷歌翻译
In this chapter, we review and discuss the transformation of AI technology in HCI/UX work and assess how AI technology will change how we do the work. We first discuss how AI can be used to enhance the result of user research and design evaluation. We then discuss how AI technology can be used to enhance HCI/UX design. Finally, we discuss how AI-enabled capabilities can improve UX when users interact with computing systems, applications, and services.
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views. However, the preserved high-level semantics do not contain enough local information, which is vital in medical image analysis (e.g., image-based diagnosis and tumor segmentation). To mitigate the locality problem of comparative SSL, we propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding but has not drawn much attention in SSL. The resulting framework can be formulated as a multi-task optimization problem on the feature pyramid. Specifically, we conduct multi-scale pixel restoration and siamese feature comparison in the pyramid. In addition, we propose non-skip U-Net to build the feature pyramid and develop sub-crop to replace multi-crop in 3D medical imaging. The proposed unified SSL framework (PCRLv2) surpasses its self-supervised counterparts on various tasks, including brain tumor segmentation (BraTS 2018), chest pathology identification (ChestX-ray, CheXpert), pulmonary nodule detection (LUNA), and abdominal organ segmentation (LiTS), sometimes outperforming them by large margins with limited annotations.
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译